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i 1 Abstract 

H . 

, This paper focuses on recovering an unknown vector j3 from the 

noisy data Y = X/3 + <r£, where X is a known n x p-matrix, £ is a 
standard white Gaussian noise, and a is an unknown noise level. In 
order to estimate /3, a spectral regularization method is used, and our 
goal is to choose its regularization parameter with the help of the data 
Y. In this paper, we deal solely with regularization methods based 
on the so-called ordered smoothers (see [13]) and extend the oracle 
■ inequality from to the case, where the noise level is unknown. 

o 

On 

00 ■ 1 Introduction and main results 

in 

This paper deals with recovering an unknown vector (3 € W 1 from the noisy 
data 

Y = X/3 + < 

where X is a known n x p-matrix with n > p, £ = (£(1), . . . ,£(n)) T is a 
^ ■ standard white Gaussian noise (E£(fc) = 0, E£ 2 (fc) = 1, k = 1, . . . , n ), and 



S3 



a is an unknown noise level. 

In spite of its simplicity, this mathematical model plays an important 
role in solving practical inverse problems like gravity problems (see, e.g. [3]), 
tomography inverse problems [12], and many others. As a rule, in inverse 
problems n and p are very large and therefore the main goal in this paper 
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is to propose an approach suitable for n = oo, p = oo, severely ill-posed 
matrices X T X, and the unknown noise level. 

We begin with the standard maximum likelihood estimate 

A) = argmin||y-X/3|| 2 = (X T X)- 1 X T Y, 

where ||z|| 2 = Ylk=i z2 (k)- It is well known and easy to check that 

E(/3 -/3)(/3 -/3) t = ( j 2 (X t X)- 1 
and thus, the mean square risk of $o is computed as follows: 

= a 2 ^A" 1 (fe), (1.1) 
fc=i 

where A(fc) and ipk 6 W 1 here and below are the eigenvalues and the eigen- 
vectors of X T X 

X T X^ k = \(k)tp k - 

In this paper, it is assumed solely that A(l) > A(2) > • • • > X(p). This 
assumption together with (|l.ip reveals the main difficulty in 0q : its risk may 
be very large when p is large or when X T X has a large condition number. 

The natural idea to improve (3$ is to suppress large A -1 (A:) in (jl.ip with 
the help of a linear smoother. Therefore we make use of the following family 
of linear estimates 

P a = Hj Q: a e (0, a ], (1.2) 

where H a , a G (0, a°] is a family of p x p-smoothing matrices. 

In what follows, we deal with the smoothing matrices admitting the 
following representation 

p 

k=i 

where U a {\) : K+ -)• [0, 1] is such that 

lim^ Q (A) = l, ]imHJ\) = 0. 

In the literature (see, e.g., [9]), this method is called spectral regulariza- 
tion. It covers widely used regularizations methods such as the Tikhonov- 
Phillips regularization [20] known in the statistical literature as ridge re- 
gression, Landweber's iterations [2], the /Li-method (see e.g. [9]), and many 
others. 
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Summarizing, (3 is estimated with the help of the family of linear esti- 
mates f3 a , a G (0,a°] defined by (jl.2p and our goal is to find based on the 
data at hand the best estimator within this family. Notice that for given a, 
the mean square risk of f3 a is computed as follows: 

L a (p) d ^ f E||/3 a - /3|| 2 = £[l - h a (k)] 2 ((3, tp k } 2 + a 2 ^ X~ 1 {k)h\{k), 

k=l k=l 

(1.3) 

where 

v 

h a (k) = n a [X(k)] and (/3,^)= f ^/3(/)Vfc(0- 

i=i 

It is easily seen from fjl .3|) that the variance of /3 Q is always smaller than 
that one of fio, but f3 a has a non-zero bias and therefore adjusting a we may 
improve the risk of /?o- This improvement may be very significant when 
((3,ipk) 2 are small for large k. 

In practice, a good choice of the regularizing matrix family H a is a 
delicate problem related to the computational complexity of f3 a . For details, 
we refer interested readers to [9]. 

As a rule, practical spectral regularization methods (the spectral cut-off, 
the Tikhonov-Phillips regularization, Landweber's iterations) represent the 
so-called ordered smoothers |13| . This means that the family of functions 
{"H a (A), a e (0, a°]} is ordered in the following sense: 

Definition 1 The family of functions {F a (X), a 6 A, A G A C M+} is 

ordered if: 

1. For any given a £ A, F a (\) : A — > [0, 1] is a monotone function of X. 

2. If for some Qi,a2 6 vl and some X' G A, F^^A') < F Q , 2 (A / ), i/ien /or 
a// A G A, F ai (X) < F a2 (X). 

The next important question usually arising in practice is related to the 
data-driven choice of the regularization parameter a. In statistical litera- 
ture, one can find several general approaches to this problem. We cite here, 
for instance, the Lepski method which has been adopted to inverse problems 
in |17| . [2], [5], and the model selection technique which was used in |15| . 

The approach proposed in this paper is a slight modification of the un- 
biased risk estimation. To make the presentation simpler, we begin with 
the case, where the noise level a 2 is known. Intuitively, a good data-driven 
regularization parameter should minimize in some sense the risk L a {(5) (see 
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(jl.3p ). Obviously, the best regularization parameter minimizing L a {j3) can- 
not be used since it depends on the unknown parameter of interest (3. How- 
ever, the idea of minimization of L a (/3) may be put into practice with the 
help of the empirical risk minimization principle defining the regularization 
parameter as follows: 

a = arg min R° [Y, Pen] , (1.4) 

a 

where 

R°[Y, Pen] = ||/3 - /3 a || 2 + a 2 Pen(a), 

and Pen(a) : (0, a°] — > M + is a given function called penalty. The main idea 
in this approach is to link L a (f3) and R°[Y, Pen]. Heuristically, we want to 
find a minimal penalty Pen{a) that ensures the following inequality 

L a {P)<I%\Y,Pen]+C, (1.5) 

where C is a random variable that doesn't depend on a. It is convenient to 
define this constant as follows: 

c = HI/3 - 2 = X>~ WW- 

k=l 

The traditional approach to solving (jl.5p is based on the minimization 
of the unbiased risk estimate. In this method, the penalty is computed as a 
root of the equation 

L a (p)=E{R°[Y,Pen u }+c}. (1.6) 

One can check with a simple algebra that 

p 

Pen u (a) = 2 ^ A" 1 (A;)/i Q (A:). 
fc=l 

The idea of this penalty goes back to [1] and [7] provides some oracle in- 
equalities related to this approach. 

Another well-known and widely used approach to the data-driven choice 
of a is related to the cross validation technique [8] • In the framework of our 
statistical model, this method prompts a data-driven regularization param- 
eter which is close to 

a C v = arg min j || Y - X/3 a \\ 2 + o 2 Pen C v{®)} , 
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with 

v 

Pencv{ot) = 2 ^ h Q (k). 

k=l 

It is well-known (see e.g. [13J ) that if the risk of (3 is measured by E||X/3 — 
X/3|| 2 , then this penalty is nearly optimal and it works always well. 

However, the question Does acv works well when the risk is measured by 
E||/3 — {3\\ 2 ? has a delicate answer depending on the spectrum of X T X. To 
the best of our knowledge there are no oracle inequalities controlling the risk 
of (3cecv uniformly in f3. Notice, however, that one can show with the help of 
the method for computing minimal penalties in [1], that if \(k) < exp(-Kk), 
then the risk of this method blows up starting from some n > 0. 

The similar effect takes place in the unbiased risk estimation. This hap- 
pens because the standard deviation of R^Y, Pen u ] + C may be very large 
with respect to the mean E{R°[Y,Pen u ] + C] and therefore §LB 

may fail 

with a high probability. 

To improve the above mentioned drawbacks of the unbiased risk estima- 
tion, we define, following jllj . the penalty as a minimal root of the equation 



E sup 

a<a° 



L a ((3) - R a a [Y, Pen] -C < CiE L a o (f3) - R° a [Y, Pen] - C 



(1.7) 



where [x]+ = max{0, x} and C\ > 1 is a constant. Heuristic motivation 
behind this approach is rather transparent. We are looking for the minimal 
penalty that balances the excess risks corresponding to all possible a G 
(0, a°]. Recall that the excess risk is defined by the difference between the 
risk of the estimate and its penalized empirical risk. Note that in view of 
(|1.5p . we can deal solely with the positive part of the excess risk. 

In order to explain heuristically how Equation (jl.7p may be solved, we 
begin with the spectral representation of the underlying statistical problem. 
One can check easily that 

dcf , v j~ 



y{k) = (X T Y,i; k )/X(k) = (P,4 k ) + a?(k)/y/MK), 

where £'(&) are i.i.d. M(0, 1). With these notations, (3 a admits the following 
representation 



0a, A) = h a (k)y(k) = h a {k)P{k) + ah a {k)i'{k)/y/Mk), 
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where f3(k) = (/3,ipk)i an d therefore 



v 

3o-/3 a || 2 = ^[l-/ia(A:)]V(fc), 



(1. 

v 



11/3 - 2 = E " 2 - 

fc=i 

In what follows, it is assumed that the penalty has the following structure 

Pen(a) = Pen u (a) + (1 + j)Q(a), 

where 7 is a small positive number and Q(a), a > is a positive function 
of a to be defined later on. Recall that the first term at the right-hand side 
is obtained from the unbiased risk estimation (see Equation (ll.6p ). With 
Pen (a) we can rewrite the excess risk as follows: 

L a (J3) -R a a %Pen] -C 

= a 2 f" A" 1 (*0 [2M*0 " hi (kj\ (e 2 (k) - 1) - (1 + i)a 2 Q{a) 

ti (1-9) 
p 



+ 2^A- ] / 2 (fc) [2h a (k) - hl(k)]?(k)P(k). 

k=l 

The first idea in solving (|1.7|) is based on the the fact that the cross term 
p 

2aJ2^ 1/ \k)[2h a (k)-h 2 a (k)]£(k)P{k) 

k=l 

is typically small with respect to E{l?^[Y, Pen] + C} (see for more details 
Lemma 9 in [llj). With this in mind, omitting the cross term, Equation 
(jl,7p can be rewritten in the following nearly equivalent form 

E sup [ Va - (1 + 7 )Q(a)]+ < CiEfoao - (1 + t)Q(«°)]+ ~ D(a°), (1.10) 

a<a° 

where 

ria = £) A" 1 ^) [2^(*0 - /> 2 (A;)] [e /2 (A;) - 1] 
fc=i 



and ^ ^ 

fc=i 
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Now we are in a position to compute an approximation of the minimal 
root for (jl.lOp . It is clear that Q(a) > Q + (a), where Q + (a) is a root of 

E[Va-Q + (a)] + =D(a°). (1.11) 

To find a feasible solution to (|1.11|) . we make use of the exponential 
Chebychev inequality resulting in 

E[rj-x] p + < T(p + l)A" p exp(-Ax)Eexp(Ar ? ), (1.12) 

where r\ is a random variable, T(-) is the gamma function, and A > 0. 
Therefore we define Q + {a) as a root of equation 

inf exp[-Ag + (a)]Eexp(Ar/ a ) = D(a°). 

A 

It is easy to check with a simple algebra that 

Q+(a) = 2D(aK £ ; 

^ 1 - 2/i a p a {k) 



where root of the equation 

YfcYI = ln«r 



X>[/W^)]=log^r, (1-14) 



k=l 



and 



1 2x 2 
F(x) = - log(l - 2x) + a; + 



2 ov ' l-2x' (1.15) 

p tt (fc) = v / 2£'" 1 (a)A- 1 (A;)[2^(A;)-4(A;)]. 

The next result (see also Theorem 1 in [IT] ) shows that Q + (a) is a nearly 
optimal solution to (11.101) . 



Proposition 1 For any 7 > g > 



E sup 

a<a° L 



77a-(l + 7 )Q + (a) 



1+9 < CD 1+ i{a°) 



(7 - qf 

where here and throughout the paper C denotes a generic constant. 
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Let us now turn to the case, where a is unknown. To compute the data- 
driven regularization parameter in this situation, we replace a 2 in R^[Y, Pen] 
by the standard variance estimator 

„ 2 \\Y-Xp a \\ 2 



\l-H a \\ 2 



Thus we arrive at the following approximation of the empirical risk 



def n A ft n2 ||^ — Xj3, 
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R a [Y,Pen] = ||/3 - p a \\ + " lh ff » ^(a) 

-"all 

and the data-driven regularization parameter is computed now as follows: 

a = arg min R a \Y, Pen] . 

ao<c«<a 

Notice that in contrast to the case of known a, it assumed that a is 
bounded from below by a Q . This constraint ensures that with a hight prob- 
ability [a 2 — <x|]-f < ct 2 /2 uniformly in (3 £ MP. Unfortunately, when this 
inequality fails we cannot control correctly the risk of f3s, since it may blow 
up (see [1] for similar phenomenon in the model selection). So, to avoid 
the blowup, we need a relatively good estimate of a, or equivalently, large 

111 --frail 2 - 

Stress also that since a Q cannot depend on a, we would like to have a Q 
as small as possible to be sure that the methods works for small noise levels. 
From a mathematical viewpoint, this means that we need a relatively good 
upper bound for E|<r — a\\Pen{a). Roughly speaking, we have to check that 
with a hight probability 

\a 2 - a 2 & \Pen(a) < a 2 Pen(a). 

The main difficulty in proving this equation is related to the fact that the 
random variables a 2 - a\ and Pen(a) are dependent. To overcome this 
difficulty we make use of the law of the iterated logarithm for a 2 - al 
combined with a generalization of the Holder inequality (see Lemmas U] and 
[3] below). To carry out this approach, we need the following additional 
condition: there exists a positive constant C2 such that for all a 6 (0, a°] 

p 

\K\\l >C 2 J2^Hk)h a (k), (1.16) 

k=l 

I'^IK +max M^) >C 2 D(a), (1.17) 



log[D{a)/D(a°)} k \{k) 



S 



where 



\\h a \\i = ^x- i (k)hi(k). 



k=l 



Denote for brevity 

def 1 



*(a ,a°) 



\l-h n 



log log 1 + 



|l-^ol| 2 

\l-h n °p 



1/2 



+ log 1 



Pen(a ) 
Pen(a°] 



The following theorem controls the risk of via the penalized oracle 
risk defined by 



r((3) d ^ ini [ o RM, 



where 



R a (P) = Ep{R a [Y,Pen}+C}=L a (P) + (l + 1 )a 2 Q + (a) 



+ 



11 na|1 k=i 



Theorem 1 Let Pen(a) = 2^Li A" 1 (A:)/ la (A:) + (l + 7)Q + (a) withQ + (a) 
defined by ( fi.i J - TT75]) and suppose \1.10^i~T7^ hold. Then, uniformly in 

(3 £ W, 



+ 



1 + C7*(a ,a°) + C71og- 1 / 2 
Ccr^a ) 



r(/3) 



-K 



a 2 D{a° 
r(/3) 



r(/3) 



1 



a 2 jD(a°) 7 



(1.18) 



[l-C*(a ,a°)/ 7 ] + V7 
where 1Z(x) = x/log(x). 

Notice that Equation 1 1 , 1 81 can be rewritten in the following concise form 

r(/3) 



1 + C*(a ,a°) + * C 



where ^ ao is a bounded function such that 



r(/?), (1.19) 



lim 1V, 7 (x) = 0. 



(1.20) 
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The statistical sense of (|1.19p is rather transparent: this equation shows 
that in typical nonparametric situations the method works like the ideal 
penalized oracle with the risk r(/3). The typical nonparametric situation 
means that 

• p is large, so, for properly chosen a Q , ^(a ,a°) is small, 

• the vector ((/3, •••(/?, ^p)) T contains many significant components, 
and thus r((3) > a 2 D(a°). 

These assumptions are typical in the minimax estimation, where it is as- 
sumed that /3 belongs to an ellipsoid. Notice that with the help of (I1.19ffl~2"0l) 
one can check relatively easily that for a proper chosen spectral regulariza- 
tion, 0a is the asymptotically minimax estimate up to a constant (see for 
details p] and [IS]). 

We finish this section with a short discussion of Conditions (11.16ffLT7j) . 
Equation (|1.16p means that h a (k) vanishes rather rapidly for large k. This 
is always true for the spectral cut-off method (h a (k) = l{a\(k) > 1}). 
Indeed, if \(k) x k~ p with some p > 0, then 

\\h a \\ 2 x ~a- p ~\ D(a) x a- p - 1/2 

and it is seen easily that f|l. lTj) is fulfilled. Assume now that X T X is severely 
ill-posed, i.e., X(k) x exp(— nk) with k > 0. Then 

max X(k)h a (k) x exp(K/a) and D(a) x kT 1 ^ 2 exp(/«/a). 

k 

Therefore pTT7|) holds with C 2 = kT 1 / 2 : . 

2 Proofs 

2.1 Ordered processes and their basic properties 

The main results in this paper are based on a general fact which is similar 
to Dudley's entropy bound (see, e.g., [21J). Let Ct be a separable zero mean 
random process on R + . Denote for brevity 

A<(t 1 ,t 2 ) = ( tl -(t 2 . 

The following fact (see Lemma 1 in [H] ) plays a cornerstone role in the proof 
of Proposition Q] and Theorem [U 
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Proposition 2 Letv 2 , u G M + , be a continuous strictly increasing function 
with Vq = 0. Then for any A > 0, 

, „ l\ A C (s,t)l log(2)v/2 
logEexp< A max ^-L K < BWV 



1 o<s<t a t J y/2 - 1 

A<( S ', S ) 



+ max max logEexp<zA 



0<s'<s<t| 2 |<^/(^2_i) " \ A t '(s / ,s)j' 



where A v (s',s) = \/\v% — v 2 ,\. 



Definition 2 A zero mean process Q, t G R + is called ordered if there exists 
a continuous strictly monotone function v% , t G K + and some A > suc/i 

AC(s',s) 



sup E exp 



A 



A v (s',s) 



< oo. 



The next two propositions (see Lemmas 2 and 3 in (TTJ) show that the 
ordered process Q can be controlled by the deterministic function vt- 

Proposition 3 Let Q be an ordered process with £o = 0. Then there exists 
a constant C(q' , q) such that for all 1 < q' , q < 2, uniformly in z > 

C(q',q) 

j<,t - zv iU - 

t>0 

where [x} + = max(0, x). 



E sup J < zgV(g _ 1} , 



Proposition 4 Assume that there exists a monotone function vt, t > such 
that a random process Q, t E M + , satisfies 

Esup[C t -^*<^^, 

for any z > and some q' > 1, g > 1. Tnen i/iere exists a constant C' such 
that for any random variable r G M + £/ze following inequality holds 

[V\(r\ q '] 1/9 ' <C'[Bvf] 1/{qq,) . 

In what follows, we focus on typical ordered processes related to the 
empirical risk. The following two propositions (see Lemmas 4 and 5 in 
are essential in controlling the cross term 
p 

aJ2 \~ 1/2 (k)[2h a (k) - hl{k)]i'{k)(3{k) 

k=l 

in the case, where a is a random variable depending on £'(&;), k = 
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Proposition 5 For any given a > and any z > 0, 
E sup \^[h a (k)-h a (k)]b(k)?(Js) 



0<a<a° 



fc=l 



■fc=l 
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Proposition 6 Lei a be a given smoothing parameter. Then for any p € 
[1,2), i/iere exists a constant C(p) so that for any data-driven smoothing 
parameter a, 



E 



E[ft4(*)-Aa(*)]A- 1/2 (^(^(fc) 



fc=l 
< 



C(p) {E max A -1 2 - /i s (A;)] 2 /3 2 (A;) 



"IP/2 



L fc=l 



+ 



C^jmaxA-^A;)^!^)} 2 ' 2 E - h & (k)) 2 p 2 {k) 



k=l 



p/2 



In order to obtain oracle inequalities in the case, where the noise variance 
is unknown, we will need the following lemma generalizing Proposition [3l 

Lemma 1 Let 

Ub) = J2[l-h a (kM'(k)b(k), v 2 a (b) = YM-Kik)fb 2 {k), K = 2 

k=i k=i ^ 2 l ) 

Then uniformly in b £ M. p 

Eexp{ sup [C a (b) - Kv 2 a (b)]} < C. 



Proof. Since h a (-),a > 0, is the family of ordered functions, it is not 
difficult to check that 



nCAb)-Ub)?<\v 2 Ab)-vl( 
Indeed, we can rewrite (|2.ip in the following equivalent form 

BCaib)Ca(b)>wm{vl,(b),v 2 a (b)}. 



(2.1) 
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Assume for definiteness that h a (k) > h a /(k), k = 1,2,..., p. Then 1 
h a (k) < 1 — h a '(k), k = 1, 2, . . . ,p, and we get 



fc=l 



fc=i 



thus proving (|2.1|) . 

Since Ca(b) is a Gaussian process, we obtain by ([2.1 



logEexHA )<* (2.2) 



y/K(b)-vl(b)\ 



2 



We may assume without loss of generality that a a is a continuous func- 
tion in a £ M + . Then let us fix some e > and define a/ € M + as roots of 
equations 

vl i {b) = (l + e)\ l>0. 

Since v 2 (6) < X^fc=i b 2 (k), the set of a/ is always finite but it may be empty. 

Let a* be a root of the equation v^*(b) = 1. Then by Proposition [2] and 
(j!T2j) we obtain 
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Eexp{max[Ca(&)-tfi£(&)]) 

< Eexp{max[C a (6) - ) + Eexpjmax [C a (6) - Kv 2 a (b)]\ 

< Eexp{maxC Q (6)) + VEexpj max [&(&) - Kv 2 (&)] } 



i>0 



<C + y^Eexp< max 

l>0 K 



< C + Cj^exp 

l>0 

< C + C exp 



«>0 



if 



(^/2-l) 2 1 + e 
= C + C y exp( -(1 + e)'- 1 t^-t ) . 



This equation with e = 0.5 completes the proof. ■ 
2.2 Recovering the noise variance 

In this section, we focus on basic probabilistic properties of the variance 
estimator 

l \Y~XPa\\ 2 



II - H~\\ 2 

l x -"all 



in the case, where a is a data-driven smoothing parameter. We begin with 
a simple auxiliary fact. 

Lemma 2 Let rj and rj be nonnegative random variables. Then the follow- 
ing inequality 



2 9 ~ 1 A <? f rl 

E " V£ (2T^ E "' log T + W 

+ - -Er/log^ 1+E 



V 

ex P( - 



(2.3) 



(2 - q)i 

holds for any A > and q G (1, 2). 

Proof. Consider the following function 

F(z, y) = f maxix 9 ?/ — z[exp(x) — 1 — x]\. 

x>0 I J 



+ qXmrj' 
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Differentiating x q y — z[exp(x) — 1 — x] in x, it is easy to check that 

F(z, y) = x%y - z[exp(x*) - 1 - x*] < x q y, 
where x* is a root of the equation 

,9-1- 



x* = log 1 + 



qyxl 



Since log(x) is convex, it is clear 



log 1 + 



qyxi 



< log 1 + 



qy 



+ 1 + 



qy\ 1 q(q-i)y 



( X * - 1). 



Therefore x* < x* , where x* is a root of the following linear equation 



x * = log( 1 + — I + I 1 + 



qy\ l q{q-^)y 



(x* - 1). 



Since q > 1, with a little algebra we get 



x* < 1 + 



qy 



1 g(2-g)y 



-i 



log I 1 + — I < 

W 2 - g 



1 



log 1 + 



qy 



thus arriving at the following upper bound 



F(z,y) < 



log«(l + ^ 



(2 - g)^ 

Now we are in a position to finish the proof. Notice that for any A > 



, ■n 



< max< rl 

x>0 1 



5) 




1 - 


7/" 






A 












2; 


exp 





A 



F(z,r/), 



and therefore 



Er/'ri q <\ q {EF(z,r]') + zE 
1 



exp 



<A 9 



(2 - g)< 



Er/log 9 1 + 



1 
A 



exp 



A 



A 
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Next, substituting in the above equation 



z = qEr/l E 



cxp 



we obtain 
Er/r/ 9 < A 9 



(2 - q)i 



cxp 



A 



+ oEr/ 



(2.4) 



Finally, applying the following inequality 

log 9 (l + xy) < [log(l + y) + log(l + x)} q < 2 9 " 1 log 9 (l + x) 

+ 2 9 - 1 log 9 (l + y), x,y > 0, 

we get 



Er/log^l + ^E 



exp 



A 



+ 2' ? - 1 log 9 h +E 



exp(- 



and combining this equation with (|2.4p . we finish the proof of f|2.3|) . ■ 

Lemma 3 Zei r] be a nonnegative sub- Gaussian random variable, i.e., such 
that for all A > and some S > 



Eexp(r//A) < Cexp(57A^). 



27ien /or any q G [1,2) 
Er/V 



1/9 CS 

< 



2-9 



Er/ 9 log 9 1 1 + 



Er/ 



'9 



1/9 



(2.5) 



(2.6) 



Proof. Replacing rj in (|2.3p by rj q and substituting (|2.5p in (|2.3p . we 
get with A = S 



09-1 eg 

En'V < -Er/ 9 log 9 

' ' - (2 - q)l ' B 



1 + 



rf 



Er?"? 



99-1 qq 



Let F(x) = xlog 9 (l + x). It is clear that 



F'(x) = log 9 (l +x) + 



qx log 9 x (l + x) 



1 + x 
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is increasing in x and therefore F{x) is convex. Therefore by Jensen's in- 
equality 

> log(2)Er/<?, 



1 + J — 

Brfi 



Eif log 9 
and thus, we arrive at (]2.6p , ■ 
Lemma 4 Let 

c a =j2[i-h a (k)f[i-e(k)] 

and 



y own h V»n i .i JI(l-V) 2 ll 2 ex P (2) 
S a = 2||(1 - h a ) ||^ /log log u 2 2 . 



T/ten /or any s € (1, 2], 



C a ~s£ a . 1 , C ( (3 - sfx 2 

" 1 SU P UTi t — vjTT — x I — 7 77? ex P 



.q<q° IK-l - h a ) 2 \\ J (s-1) 3 1 16 
Proof. For some e > define a^, > 0, as roots of equations 

ii(i-/oY = (i+*ri(i-MT. 

Then, denoting for brevity 

G fc+ i(x) = sS afc+1 + x||(l - n afc+1 ) 2 ||, 

we obtain 

a<a° ||U - /ia) 2 || ~ ~ ^ Ue^,^] 11(1 ~ K) 2 \ 



P \ SU P TT71 T^2iT - x f - P 1 SU P TTm T^2iT - x 

U<a° [|(1 — ^«) 2 [| J ~ Ue[a fc+ i,a*] IK 1 ~ h a) W 

oo ✓ 

<Y, P \ SU P Ca>G fc+1 (x) 
fc=0 la6[afc+l. a fe] 

OO f 

<E P |Ca fc+1 >[l-/(e)]G fc+ i(x) 

fc=0 ^ 

+ f>{ sup [C a -Ca fe+1 ]>/(e)G fe+1 (x)) 

fc=0 <ote[ot h+1 ,a k ] J 

where /(e) will be chosen later on. 



(2.7) 
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Since log(l + x) > x — x 2 /2, x > 0, then for any A > 

Eexp(AC Q ) <exp[A 2 ||(l-/i a ) 2 || 2 ], (2.8) 
and by the exponential Tchebychev inequality we get 

[l-/(e)] 2 G 2 +1 (*)' 



P {Ca k+1 > [l-f(e)]G k+1 (x) <exp 



< exp<^ -s 2 [l - /(e)] 2 logp + 1) log(l + e 



4||(l-^ fc+1 )2||2 
M [1-/W] 2 X 2 



(2.9) 



To bound from above the last term in Equation (12. 7|) . we make use of 
that 2h a (k) - h 2 a (k) is a family of ordered functions, and thus (see f)2. 1 [) ) 

||(i - h ak f - (l - V +1 ) 2 II 2 < 11(1 - K k ?f - ||(i - V. +1 ) 2 II 2 . 

Similarly to flZS]) 

Eexp{A[C afc - Ca k+1 }} < exp[A 2 ||(l - K k f - (1 - K k+1 ff] . 

Therefore with the help of Proposition [2] and the exponential Tchebychev 
inequality we obtain 
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<^ sup [Co - Ca k+1 ] > f{e)G k+ i{x) \ < minexp<^ -Xf(e)G k+ i(x) 

Ut+i<a<ttfc J A>0 I 



21121 



4[||(l-^ fc ) 2 H 2 -||(l-^ fc+1 ) 

- UeXP \ 8[||(l-V) 2 ll 2 -H(l-^ +1 ) 2 ll 2 ]J 
= Cexpj- ^" 1 ^ 2 ^ logp + 1) log(l + e)] 

(V2-1)W(6) 
8e 

(2.10) 

Now we chose /(e) to balance the exponents at the right-hand sides in 
(12. 9h and (I2.10p . thus arriving at following equation for this function 

(V2-l) 2 / 2 (6) M ,,,. a 



2e 

This yields 



[i - /(«)]- 



>/2- 1 + v 7 ^' 
With this /(e) and with (p7H2~10|) we get 

P J sup ,.C--»g-„ > 4 <- - 'M1 V / 4 ' 



„<T- ll(i - M 2 II J - e * I i 1 -/(.)]»{ s ![i_/( £ )]2_i} + - 

Finally, choosing e as a root of /(e) = (s — l)/2, we finish the proof. ■ 
We summarize the main properties of the variance estimator in the fol- 
lowing lemma. 

Lemma 5 For any q £ (1, 2) 

E<jV 2 - &l] + a~ 2 Pen(a)Y < C(2 - q)~ q ^ q (a , a°)E [Pen(d)] 9 . 
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Proof. By fll.Sp we obtain 

2(7 



! 'Q: | 



k=l 



\l-h* 



Y,[l-h & (k)} 2 e(k)P(k)VHk) (2.11) 



fc=l 



^_^[l-^(fc)]V(Ar)A(fc). 



fe=l 



The first term at the right-hand side in (|2.1ip is controlled with the help 
of Lemmas [3] and [J] (with s = 2) as follows: 



E 



Pen(a) 



li.- M= 

' E 



fc=l 




Pen" (a) 


Ca — 2Sq, + 2Sq, 


111 — ft- IK 

II ^ "-all 


111 - Ml 



< 



c 



II - fcr 



l-h ||2\ >. 1/2 
log log ( 1 + £ I l> [EPen" (a)] 



+ 



C 



(2-p)\\l-h c 



log 



1 — h a o || 2 

Pen(a Q ) 



1 + 



Pen(a°) 



[EPen q (a) 



1/9 
1/9 



(2.12) 



To control the last two terms in (|2.1ip . notice that h a {k) = 2h a (k) — 
ha(k), a > 0, is a family of ordered functions. Hence, applying Lemma CD 
with 



b(k) 



Ka 



we have 



Eexp 



Ka 2 



2*^[i-h & (k)]?(k)mVxF) 
k=i 

p 

J2[l-h&{k)] 2 l3 2 (k)\(k) 



k=l 

This inequality and Lemma [2] with 
rj = Pen q (a) 
2 



(2.13) 



< C. 



n 



Ka 2 



fe=l 
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yield 



E 



2a 



1 - h & \ 



■^[l-h & (k)] 2 ^(k)P(k)VMF) 



k=l 



^-^^^j^li - K{k)} 2 p\k)\(k) 
1 Q " fc=i 



Pen q (a) 



(2.14) 



< 



Ca 2 i 



(2-q)i\\l-h ao \\^ 



log" 



1 + 



Pen(a ) 
Pen(a°) 



EiW(a) 



Finally, combining ()2.1ip . (|2.12p . and (12. 14ft and using Jensen's inequal- 
ity, we finish the proof. ■ 

2.3 Proof of Theorem Q] 

The following proposition (see Lemma 7 in [11]) summarizes some basic 
properties of the penalty defined by (jl.l3ffT7T5]) . 

Proposition 7 



i /, D(a) 1 , D(a) 
Q + («)>^)max^/l g^,-log^i 



fl / D(a) n 

lie > mm< —\ / log — ; — r, — >, 
^ \2y D(a°) '4j' 

^/■D(a) > exp(2)L>(a°) ; toera 

D(a) > /x Q Q + (a) 

For any ai < a 2 



log 



fi a Q + (a) 
D(a°) 



D(ai) Q + (ai) 
D(a 2 ) ~ Q+(a 2 y 

We begin the proof of Theorem [1] with a simple generalization of Propo- 
sition El Consider the following random process 

i£=EA- 1 (kK(k)[{ a (fc)-i] 1 
fe=i 



where h%(k) = [2(1 + e)/i Q (fc) - eh* (fc)]/(2 + e). 
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Lemma 6 Let q 6 (1, 2]. Then for any random variable a < a 



Cat 



E 



£ \ f 



where 



< = {2 f>- 2 (AO 
fc=i 



1/9 



1/2 



Proof. It is based on the following fact. Let S'(x) = x 1 '^ x S 
Then 



tt<tt° I \ c. 



1/ °a 



< C(t1„ / xSi -)e- Cx2 dx, 



xS[ — |e 

\ z , 

where S~ 1 (x) = x q ~ l denotes the inverse function to S(x). 

To prove this inequality, define ak, k = 0, 1, 2, ... as roots of the following 
equations 

ot. k =o* aa S(l/z)e k . 

Then, noticing that rf a — rf ak is an ordered process, we obtain by (|1 . 12j) and 
Proposition [2] 



p(z)<E sup \rj e a \+J^E 



k=2 



sup 

a k <a<a k -i 



-11 ^-1 



< Ca e a oS(-) f; e fc exp{-C[z5- 1 (S(z- 1 )e fe )] 2 } 

k=0 

f'OO POO 

<Ca% a J exp{-C[zS- 1 (u)] 2 }du = Ca e a o J e -0 * 2 * 2 dS(v) 



S(v)ve- Cz v dv = Col 



.r,S'( — p. 



Cx2 dx. 



Next we get by the Laplace method 



/I \ 1/(9-1) 



exp 



log' 



2(<z-i) g-i 



(2.15) 



22 



To finish the proof, denote for brevity 



E = E 



a 



6 \ q 



Then by (|2. 15[) we obtain with a simple algebra 
Erj e a < mmizEala^S^i^A + J exp 



< CiT^o min<i zi£ + { — ) exp 



lop 



2(g-l) °g-l 



log- 



9 



2(q-l) °q-l 

c 



< 



e „ p}/q 



The following important lemma provides an upper bound for Lq,(/3) + 
(l + 7 )Q+(a). 

Lemma 7 For any data-driven a and any given a £ [a ,a; ], the following 
inequality 

l +7 /4\ 1 /(l+7/4) 



{E[ - 2 L,(/3) + (l + 7 )Q+(a)] 1+ ^ 4 } 



< 



C 



[l-C7*(a ,a°)/7]- 



RM , D(a°) 



+ 



7 1 



ZioWs uniformly in (3 G MP and 7 £ (0, 1/4). 



Proof. In view of the definition of a, for any given smoothing parameter 
a, Ra[Y, Pen] < Ra[Y, Pen]. It is easy to check with the help of (|1.8p that 
this inequality is equivalent to the following one 



LM + (1 + l)<J 2 Q + {&) -a 2 Y, X-Hk)h & (k)[e 2 (k) - 1] 

k=l 

V 

+ 2aY, A- 1/2 (fc)[l - h & {k)] 2 £'(k)P(k) + [a 2 & - a 2 ]Pen(a) 
k=i 

< LM + (1 + i)a 2 Q + {a) -a 2 f^ A -1 ^)^ (*0K' 2 (*0 ~ 1] 

k=l 

V 

+ 2o- ^ A _1 / 2 (A;)[1 - / ia (A:)] 2 e , (fe)/3(fe) + [<r? - a 2 ]Pen(a), 



(2.16) 



fe=l 
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where h a (k) = 2h a (k) — h^(k). We can rewrite (|2.16p as follows: 
| [L & (P) + (1 + !)a 2 Q + (a)] < L s (0) + (1 + j)a 2 Q + (a) 



k=l 



+ <x 2 £ A-^MA:)^) " 1] " (l + I ~ £V ( 2 - 17 ) 
k=i ^ ' 

+ 2ajyX- 1 ' 2 (k)[h dl (k) - h a (k)]e(k)P(k) -(l- ^jL & ((3) 



E 



~ o 2 ]+Pen(a) + [a 2 - a 2 & ]+Pen(a). 
Since a is given, we get by Jensen's inequality 

P 1+7/4 f V \ I/2+7/8 

r\-Hk)h(k)[e(k) - 1] < c\^2x-\k)hi(k) 

=l [ ti J (2.18) 

< C[D{a)] 1+lli < C[^ 2 i? s (/3)] 1+7/4 . 
The third line in (|2.17p is bounded by Proposition Q] as follows: 



E 



j2\- l {k)h & (m> 2 (k) - 1] - (i + \ - f)Q 



la 



k=l 

< 



1+7/4 



(2.19) 

CD 1+ T/ 4 (a°) 



where 7 < \j\[2. 

The upper bound for the fourth line in (|2.17p is a little bit more tricky. 
Since {h a (-), a G (0,a°]} is a family of ordered functions, we obtain by 
Proposition [5] that for any e > and given q' > 1/2, 



E 



2a \- 1,2 {k) [h & {k) - hs(k)] ?(k)p(k) 



1 ' (2.20) 



p 



fc=i 



9 

< 



(C7e)9/(V-i) ' 

To continue this inequality, notice that if a > a, then 



M9<i, i^>Mfc) 
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and therefore 



J2[h*(k)-h & (k)] 



k=i 



\{k) 



k=i 



2 q2 



P 2 {k) 
X(k) 



Sw 



< rnax^£[l - M*)] V(*0 (2.21) 



9 / \ OO 

< max 

s A(sJ 



fc=i 



Similarly, if a < a, then 

00 a2n \ Z2 I \ 00 

- M*0] 2 ^ < max%^£[l - ft-a(A;)] 2 /3 2 (A;) 



fc=i 



A(-) 



fc=i 



< max lM L W^[l_/ l _( A; )] 2 /3 2( fc ). 



AW 



fc=i 



So, combining (|2.2lH2~T2"2"j) with Young's inequality 

yx s - x < (1 - s)s s/(1 - s) y 1/(1 - s) , x,y > 0, s < 1, 

gives 

r oo 

2\-l/ 



fe=i 



<[!--. - 



(2.22) 



1-^)^09) 



fe=l 



-Aa(k)] 7? 2 (A;) + a 2 max 



fc=i 



F" \{k) 



L & (/3) (2.23) 



97(1-9') 
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Thus, by QZHD and (gjgjj) we obtain 



E 



2aJ2x- 1 / 2 (k)[h & (k) - h s (k)]?{k)0(k) - ( 1- 1)L & (P) 



k=l 



< CB 



2aJ2\~ 1 /\k)[h & (k) - h s (k)]e(k)P(k) 
k=l 
p 



fc=i 



+ CE 



fe=i 



l.-^Ji«C8) 



< (Ce) 57=1 + C e l=7 



h s (k)] y 2 (k) +cj 2 max 



fe=i 



_99 

1-9' 



Therefore, substituting in the above equation q' = 2/3 and 



[1 - fca(*)] P 2 (*0 + tr 2 max 



■fc=i 



h&k) 
V X(k) 



-3q 



we get 



E 



2c7 Xj *~ 1/2 (*0 [fcfi(fc) - h(k)) ?(k)P(k) - (l - L & (J3) 



< C 



1 1 - ha(k)] /3 2 (k) + a 2 max 



hl(k) 



k=l 



k \(k) 



(2.24) 



Now we proceed with the last line in Equation (|2.17p . Since a is given, 
we have by (|2.1ip 

2 r p ~\ 1 / 2 



k=l 



+ 



2a 



\l-h* 



Y,^-ha(k)]y 2 (k)X(k) 



k=l 



1/2 



+ in V112 El 1 - ha(k)} 2 p 2 (k)X(k) 
" a " k=i 

< + ~ ^ llo f][l-^(A:)] 2 /3 2 (fc)A(fc) 



l-/in 



fc=l 
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and therefore 

E{[4-a 2 ] + Pen(a)} 1+7/4 <C[a- 2 vI/K,a )^(/3)] 1+7/4 . (2.25) 
The last term in (|2.17|) can be bounded by Lemma[5]and (|1.16|) as follows: 

E{[a 2 - al] + Pen(a)Y^ 

< C^\a )B{a- 2 L & ((3) + (1 + 7 )Q+(a)} 1+7/4 . 

Finally combining Equations (I2TT7D . IE1H]) . (12TT9D . (l23ij) . (I2T25D . (12T26D . 
we finish the proof. ■ 

The next idea in the proof of Theorem Q] is that the data-driven param- 
eter a defined by (jl.4p cannot be very small, or equivalently, that the ratio 
D{a) / D{a°) cannot be very large. 

Lemma 8 For any data-driven a and any given a £ [a 0) a°], the following 
upper bound holds 





" D(a) ' 






D(a°)_ 


1+7/4 | 







< 



c 



-n 



[l-C#(a ,a°)/ 7 ] 
for any 7 G (0, 1/4). 
Proof. Representing 

(1 + 7 )Q + («) = (1 + + |Q + («), 

we obtain with a simple algebra from (|2.17p 



(2.27) 



7<7 

~2~ 



-fc=i v 7 



r (*) - 1] 



+ £T SUp 

a<»° 



't^o-'i-f'^-?)^!-: 



-fc=l 



+ 2a^A- 1 / 2 (fc)[/ l(i (A : ) - ^(fc)]C , (fc)/3(fc) - - ^La(/?) 

+ [<7? - a 2 ]Pen(a) + [cr 2 - <r?] + Pen(d). 
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Combining this with Equations (j2~T8]) . (I2TT9]) . ([232D . (^25]) . (|2T26l) . we 
obtain 



E 



£>(a°) 



< 



C 



1+7/4 1 1/(1+7/4) 

CRM 



[l-C*(a ,a )/7]. 



+ 



a 2 7L»(a°) 7 



(2.28) 



To continue this inequality, we need a lower bound for ||/iq.||| + Q + (a) 
Notice that 



fix) = F(x) 



l-2x 2 
is a non-negative function for x > since 



1 x 
log(l — 2x) + x + 



l-2x 



f'(x) 



2x l 



(1 - 2x) 2 

Therefore the following inequality holds 



> and /(0) = 0. 



F(x) > 



x 



Let 



k a = arg max 



1 - 2x 

h a (k) 



k \{k) ' 

then by (|1.14p and (|2.29p we obviously get 

[p a Pa{k a )\ 2 



log-^l> F[p aPa {k a )} > 



D(a°) 

With this inequality we obtain 



1 2p a p a (k 



pa Pa (ka) < U + 



1 + log 



-1 D(<x) 
D(a°) 



1/2n -1 



thus arriving at 



Pa > 2pa(k c 



(2.29) 
(2.30) 



(2.31) 
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Now we are in a position to bound from below + Q + (a). By 

(fTTMLTol) . (gjSMSP , and ([TT7|I we obtain 



IIMI + Q + («)>IIMa + 



2D(a) 

Ha 



k=l 



2p a p a (k) 

>ii^iiJ+^E^(*)] = ii^iil+^iog^ 



He 



k=l 



> \\h a \\l + 2p a {k a )D{a)\og 



> 



ha{kot) 

A(fc„) 



log 



L>(a° 

£>(a) 
L>(a°) 
D(q) 
L>(a°) 
£>(a) 



(2.32) 



Z)(a) 7i a (fc) , . 

log — j—^- max — — — > CD (a) log ■ 



With the help of (j2T32l) we continue (T2T28T) as follows: 



E 



D(a) D(a 
log 



D(a° 

< 



D(a°) 
C 



1 + 7/4 N 1/(1+7/4) 



} 



CRM 
a 2 -fD(a°) ' 7 4 



(2.33) 



1 



[l-C*(a ,Q°)/ 7 ]_ 
To control from below the left-hand side in the above equation, notice that 



E 



D(a) , D(d) 
log- 



D(o) D(a) 



1+7/4 



1 



(l + 7 /4) 1 +7/4 



E 



D(a) 
D(a) 



1+7/4 



log 



■D(q) 
D(a) 



1+7/4 N 1+7/4 



(2.34) 



To finish the proof, let us consider the function f(x) = x log 1 " 1 " 7 7/4 (x), 
x > 1. Computing its second order derivative, one can easily check that 
f{x) is convex for all x > exp(l) = e. So, f(x + e — 1) is convex for x > 1. 
It is easily seen there exists a constant C > such that for all x > 1 

/(*)>i/(x + e-l)-C. 
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Therefore by (|2.34p and Jensen's inequality, 



E 



D(a) , D(&) 
log 



D(a) 



D(a) 



1+7/4 





~D(a)~ 







D(d) 
D(a) 

1+7/4 



1+7/4 



+ e-l 



(2.35) 



+ e- 1 



C. 



Let 



^(z) = (x + e - 1) log 1+7/4 (x + e - 1). 

It is easy to check that the inverse function Tp~ 1 (x) satisfies the following 
inequality 

ip- l {x) < (x + e - 1) log-^^ix + e - 1). 
Therefore combining this equation and (I2.35P with (12.33|) , we arrive at (I2.27P . 



Now we are ready to proceed with the proof of Theorem [TJ Let e > 
be a small given number to be defined later on. By (|1.8p and (jl.9p . the 
following equation for the skewed excess risk 



30 



£{e) = sup E J ||/3 - Ai|| 2 - (1 + e){R & [Y] + C}\ 

r P p 
= sup EJ -e V[l - h & (k)] 2 p\k) - ea^X-H^hKk) 

-(l + e )(l+ 7 )a 2 Q + (a) 

v 

-2a^{l + e - [(1 + 2e)/» a (fc) - e/^fc)]}/?^- 1 / 2 ^)^) 



(2.36) 



k=i 



-a 2 Y, >T\k) [2(1 + e)h & (k) - ehl(k)} [i' 2 {k) - 1] 

+[(r 2 - &l]Pen(a 



k=l 



holds. 

We proceed with the second line from below at the right-hand side of 
this display. By Lemmas [6] and [5J we obtain 



a 2 E Y X~Hk) [(1 + e)h & (k) - ehl(k)] [^(k) - 1] 
C 



k=l 



< 



[l-C*(a 0) a°)/ 7 ] +v ^ 



ft 



ga(g) 1 
a 2 -fD(a°) 7 4 



(2.37) 



The next step is to bound the third line from below at the right-hand side 
of (|236|) . It suffices to note that h%(k) = [(1 + 2e)h a (k) - eh 2 a (k)]/(l + e) 
is the family of ordered functions. Hence, by Proposition El we get with 
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a = argmin ae r ao>Q oi R a {j3) 



2<xE^{l + e - [(1 + 2e)h&(k) - ehl(k)]}/3(k)*~ 1/2 (k)g(k) 



k=l 



2(1 + e)aEJ2[hUk) - h%(k)](3(k)^ 1/2 m'(k) 

1/2 



k=l 



< c 



+ c 



1/2 



L fe=i 

p 

<r 2 maxA -1 (ife)/i!(A;)E^[l - ^(jfe)] 2 /8 2 (*) 
fc fe=i 

v 

< Ct^e^EmaxX-^hKk) + eV[l- h & (k)] V(*0 

k=l 
V 

+ Ca 2 e' 1 max\- 1 {k)hl{k) + eE^[l - h & (k)] 2 p 2 {k). 



(2.38) 



fc=i 



Therefore, substituting ([2T25D . ([2T26D . (I237D . (12381) in d23S|), we obtain 
the following upper bound for the skewed excess risk 

£(e) < Ca 2 t- l ^ma^\-^[k)h\{k) -a 2 EQ + (a) + C^{a ,a°)Ra(P) 

k 

V 



+Ca 2 e~ 1 max A _1 (fc)/i?(A;) + e^[l — h a (k)] p 2 (k) 



Ca 2 D(a°^ 



[l-C*(a ,a°)/ 7 ]+V7 



fc=i 
ft 



i?a(/3) , }_ 



o- 2 7L>(a°) 7 4 



(2.39) 



Let us consider the function 



17(e) = max{C'e- 1 maxA- 1 (/c)/i 2 (A;) - Q + (a)}. 



Since 



h 2 a (k) ^ K(k) 

max < max — — < 

k X(k) k X(k) 

P 1,2, 



^A 2 (fc). 
^ D(a) 



^{E^|p-M*)] a } < 



V2 
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and by Proposition [7] 



Q+(a) > D(a)Jlog 



D(a) 
D{a°Y 



we get 



U(e) < D{a°) max 



C D{a) D{a) 



a<a° { e D(a°) D(a°) 



log 



D{al 
D(a°) 



1/2 



f Cr 

< D(a°)max\ — 

x>l { € 

One can easily check with a simple algebra that 



(Cx 
max< 

X>1 [ € 



xy/log(x) > < 



C 



cxp 



c- 



(2.40) 



Indeed, let x* = arg max x |Cx/e — x-^/log(x)|. Then, differentiating Cx/t 
x^J\og{x) in x, we obtain the following equation for x* 



^ - y/log(x* 



2Vlog(x*) 



0. 



c /c 2 



Therefore 

_2e V 4e 2 
This equation proves (|2.40p since 



^<?i I — + \/t-t - 1 j } £ (, xr>( -v ) 



max 



Cx 



Xyf\og{x) > < 



x>l [ e 

With (EHUD we continue ([2T39D as follows: 



Cx* 



C 2 

8(e) < Ca 2 D(a )eexp-z- + CV(ao,a )Ra(P) 



+ CahJ^^WhUk) + e£[l - ha(k)] y 2 (k) 



k=l 



+ 



Ca 2 D(a° 



[l-C^(Q ,a°)/ 7 ]+^7 
C 2 



k=l 

K 



a 2 -fD(a°) 7 4 



< Ca 2 D(a°)eexp-5- + Cei? a (/3) + C^(a D , a°)ii«(/3) 
e z 



+ 



C 



[l-C7^(a ,a°)/ 7 ]+V7 



ft 



o- 2 7L>(a°) 7 4 
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Therefore, substituting this equation in 

E||/3 -ht < (l + e)RM+£(e), 



we get 



E||/3-/3 A || 2 < [1 + C*(a ,a°)]r(P) + Ca 2 D(a°)x 



x inf 



e exp — + 



C 2 er(/3) 



+ 



Ca 2 D(a°) 
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e 2 a 2 D(a°) 
r(J3) + 1 



(2.41) 



o- 2 7-D(a°) 7 4 



[l-C*(a ,a°)/ 7 ] + V7 
Hence, to finish the proof of the theorem, it remains to check that 

Cu 



inf F(e, u) = inf 



C 2 

e exp — T + eu 



< 



2.42) 



Let e* = argmin e F(e, u). Then, differentiating F(e,u) in e, we arrive at the 
following equation for e* 



exp 



Thus 



[C 2 \ C 2 [C 2 \ 



c 2 , /c 2 1 

— +log "2--1 



+ u = 0. 



and it follows immediately from the above equation that 

C 



e* < 



ViogM 



and therefore 



F(e*,u) < 2ue* < 



2Cu 



thus proving (|2.42p . 

Finally, substituting (TX32D with u = r((3)/[a 2 D(a°)} in we com- 

plete the proof of the theorem. 
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